Statistical Disclosure Control |
|
Contact:Peter-Paul de WolfStatistics Netherlands P.O. Box 24500 2490 HA The Hague The Netherlands Phone: +31 70 337 5060 Last update: 10 Oct 2011 |
Microdata: new masking methodology (WP1.1)Leading partner: URVParticipating partners: Istat, UoP, StBa, CISC, URVThis workpackage is devoted to the research and development of new statistical disclosure control (SDC) methods to be included in the new release of the µ-ARGUS SDC package. Different approaches to microdata protection, most of them based on masking, will be explored and tested in the various tasks of this workpackage. A prototype implementation will be produced for each approach that can be taken as input by WP 2 (Microdata: software development) for integration into µ-ARGUS and by WP 5 (Methodology testing). The following are the objectives of the workpackage broken down by tasks: Task T1 (responsible ISTAT and UoP)ObjectivesTo build a new framework for statistical disclosure control for business microdata different from the usual framework based on matrix masking methodology. To design a matching algorithm to check the effectiveness of the proposed methodology. To improve µ-ARGUS by providing integratable implementations of the new methods defined in this task. Description of the workWe propose to develop a methodology for statistical disclosure control for business microdata based on
model estimates. A statistical model for quantitative variables that takes account of the geographical
area to which each enterprise belongs will be built. We propose limiting disclosure of sensitive
quantitative variables by releasing predictive intervals or other summaries of the predictive density
associated with the model. The estimates of area effects from the model will suggest a broader
categorisation to use when releasing the variable geographical area that goes a long way to minimising
information loss. In order to check that the suggested protection measures are indeed sufficient we will
develop a specially designed matching algorithm. Software to implement the methodology proposed in this
task will be developed using S-Plus or SAS, so that the statistical and graphical capabilities of those
packages can be utilised.
Milestones and expected result: Task T2 (responsible StBa)ObjectivesThe aim of this task is to improve the masking algorithm designed by Sullivan (see references in the Description of Work below) in order to make it applicable in practice. Extensions to the algorithm will have to be implemented which lead to useful results for complex data structures. Furthermore, it is necessary to develop a modified algorithm which can be used for partial masks. The former is necessary for combining masking with other SDC techniques and the latter allows to reflect dependencies, especially filters in questionnaires. Description of the workIn a first step, the structure of Sullivan's masking algorithm (Sullivan 1989, Fuller 1993) will be extended in order to integrate partial masks and fixed sets of values. Partial masking can be integrated by imposing linear restrictions during the mask and the iterative correction procedures. Integrating fixed dependencies will be incorporated in a similar way with other kinds of linear restrictions. For instance, we intend to test whether masking is effective (with respect to an internal distance criterion) if the values of the 'dependent' variables are fixed, because this strategy would reduce the computing time. The second step is the development of a revised data set from some business statistics (e.g. VAT statistics, cost-structure statistics) for a first application of the algorithm. This will be derived from the experience of statistical experts, who have been working with the data. For example, variables that are rarely used should be excluded and rarely occurring categories should be collapsed. This procedure can be viewed as a initial work toward the development of a scientific-use file. The third step is to apply the masking algorithm to the revised data set. Analyses with standard techniques are only valid if the whole masked (sub-)sample is included or if some separately masked sub-samples are analysed together. That is why well-defined subsamples should be masked. References
Milestones and expected results Task T3 (responsible URV)ObjectivesMicroaggregation is the most used technique for protecting quantitative microdata. The main objective of this task is to move beyond the current state of the art of microaggregation methods by developing advanced algorithms, namely data-oriented microaggregation and microaggregation of unprojected multidimensional data. A second objective is to provide C/C++ implementations of the new algorithms that can be included in µ-ARGUS. A third objective is to characterise the computational complexity of exact optimal microaggregation, which will provide a theoretical justification for the use of heuristic methods. A final objective is to compare the performance of the new algorithms developed in this task against those developed in T2. Description of the workThis task will move beyond the current state of the art in microaggregation. Such state of the art includes individual ranking, single-axis methods, weighted moving averages and multivariate methods (Defays and Nanoupoulos, 1993; Mateo and Domingo, 1999); in all cases, both fixed-size and variable-size groups can be considered and, for each approach, there is a tradeoff between information loss (data utility) and confidentiality protection (data safety). Current users of microaggregation, like Eurostat and others (e.g. Corsini et al., 1999; Nechaeva and Sokolov, 1996, etc.) will benefit from the output of this task, whose main aim is to develop new algorithms for advanced microaggregation. Efficient algorithms are known for single-axis, individual ranking and weighted moving average methods as long as the group size is kept fixed. The following subtasks will be tackled: a) Development of new algorithms for the variable group size versions of known microaggregation approaches; b) Development of new algorithms for multivariate microaggregation of unprojected data. Programs to implement all microaggregation algorithms developed will be written. Another important issue that will be dealt with is the justification of the use of heuristics for microaggregation. The idea is to characterise the computational complexity of exact optimal microaggregation, which is conjectured to be NP-hard. Microaggregation is a special case of record masking. Therefore, this task would not be complete without a comparison with other masking methods that will be included in µ-ARGUS (e.g. methods developed under task T2 of this workpackage). The comparison will be in terms of information loss and safety achieved. Finally, we expect to disseminate the results obtained in a number of high-quality scientific publications. ReferencesCorsini, V., Franconi, L., Pagliuca, D., and Seri, G., (1999). An application of microaggregation methods to
Italian business surveys. In: Statistical Data Protection'98, Luxembourg:OPOCE, pp. 109-113. Milestones and expected results- New microaggregation algorithms to improve on the current ones and on other masking techniques with regard
to information loss and disclosure risk. Task T4 (responsible CISC)Objectivesµ-ARGUS currently offers two SDC techniques for protecting categorical microdata: global recoding and local suppression. None of both techniques uses any information about the categories (or the domain) corresponding to a certain variable. If a variable is known to have as a domain a set of ordered categories (linguistic terms), microaggregation is a feasible alternative. A possible approach is to microaggregate by first translating the ordered categories into a numerical scale. The aforementioned translation corresponds to an implicit settlement of the semantics of the ordered categories. The aim of this task is to extend µ-ARGUS with mechanisms to explicit this semantics and to provide the corresponding aggregation tools for categorical microdata. Description of the workOur approach to defining methods for qualitative aggregation is based on the two-stage procedure of qualitative aggregation: (i) semantics determination and (ii) aggregation function selection. In the semantics determination stage, we plan to develop a set of tools to describe the semantics of linguistic labels. This semantics will be represented as metadata of the tables. Three types of semantics will be considered: a) Explicit interval selection (Moore, 1966), b) Explicit fuzzy interval selection (Klir and Yuan, 1995), and c) Implicit selection from pairs of antonyms (Valls and Torra, 1999). Successful completion of this stage either requires the reader of µ-ARGUS to be modified (so that some information is included in the metafile), or an extra metafile to be included, or information to be introduced by the user in an ad-hoc menu. In the aggregation function selection stage, we plan to develop a set of microaggregation functions for qualitative values on the basis of the three models of semantics description mentioned above. We expect to disseminate the results obtained in this task in high-quality scientific publications. ReferencesKlir, G., Yuan, B., (1995), Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice-Hall, U.K. Milestones and expected result- Development of aggregation functions based on three different semantics. |